-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: ensure Replica objects never change replicaID #40751
storage: ensure Replica objects never change replicaID #40751
Conversation
Hold up on this, I introduced an with a recent revision. I'll post when RFAL. |
39392d4
to
0b9ba31
Compare
1ec375a
to
53562b2
Compare
Okay this is RFAL. It's undoubtedly going to take some iteration. The testing work remains but I think the locking is now sound. |
53562b2
to
30ef8d5
Compare
Not reviewing due to vacation, but this sounds very exciting. If course it's unfortunate that we have to do this Uber release pressure, but.. well I don't know that I'm really convinced that it's unfortunate. This is a major reduction in tech debt that I'm unapologetically happy to see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very scary change to bring in at this late date. But I agree with @tbg that it's overall moving in the right direction, and I think it's probably better to do this than to try to go forward with learner replicas without this. I think it's a choice between this change and reverting learner replicas for this release.
My biggest concern is the transition: if we're moving to learner replicas by default, the interactions of all these changes with preemptive snapshots will almost certainly be under-tested.
Reviewed 24 of 24 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/client_merge_test.go, line 1666 at r1 (raw file):
repl, err := store1.GetReplica(rangeID) if err != nil { continue
This looks wrong and at least needs a comment. If we're now expecting an error here we should use a testutil assertion to make sure we're getting the specific error we want.
pkg/storage/client_merge_test.go, line 2039 at r1 (raw file):
defer leaktest.AfterTest(t)() t.Skip("these invariants are no longer true")
Remember to delete any tests that are skipped because they no longer make sense before merging the PR (i like the skip-with-comment pattern for ease of early code reviewing)
pkg/storage/client_merge_test.go, line 2883 at r1 (raw file):
// mtcStoreRaftMessageHandler exists to allows a store to be stopped and // restarted while maintaining a partition using an unreliableRaftHandler. type mtcStoreRaftMessageHandler struct {
This is only used in client_raft_test.go. Maybe move this and unreliableRaftHandler to a client_raft_helpers_test.go?
pkg/storage/client_raft_test.go, line 1196 at r1 (raw file):
// This used to fail with IntersectingSnapshotMsg because we relied on replica GC // to remove the LHS and that queue is disable. Now we will detect that the LHS is
s/disable/disabled/
pkg/storage/client_replica_gc_test.go, line 151 at r1 (raw file):
func TestReplicaGCQueueDropReplicaGCOnScan(t *testing.T) { defer leaktest.AfterTest(t)() t.Skip("the range will be removed synchronously now")
This test should probably be revived to prove that replicas that aren't GC'd synchronously will eventually be found asynchronously.
pkg/storage/client_split_test.go, line 3318 at r1 (raw file):
_, err := tc.AddReplicas(kRHS, tc.Target(1)) if !testutils.IsError(err, `snapshot intersects existing range|was not found on`) { t.Fatalf(`expected snapshot intersects existing range" error got: %+v`, err)
Update the fatal message too. Are both error messages now possible or is it reliably "was not found" now?
pkg/storage/client_test.go, line 1250 at r1 (raw file):
return retry.ForDuration(testutils.DefaultSucceedsSoonDuration, func() error { _, err := m.stores[dest].GetReplica(rangeID) if err == nil {
Assert on the specific expected error here.
pkg/storage/client_test.go, line 1270 at r1 (raw file):
log.VEventf(context.TODO(), 1, "engine %d: missing key %s", i, key) } else { var err error
Reducing the scope of the error here seems like a good practice, but I'm curious what prompted this specific case.
pkg/storage/queue.go, line 1179 at r1 (raw file):
} // Replica not found, remove from set and try again.
s/not found/not found or was recreated with a new replica ID/
pkg/storage/replica.go, line 494 at r1 (raw file):
// ReplicaID returns the ID for the Replica. It may be zero if the replica does // not know its ID.
Add comment that the ID may change from zero to non-zero, but once it has a non-zero value it will not change again.
pkg/storage/replica_application_result.go, line 310 at r1 (raw file):
} // TODO(ajwerner): the data should already be destroyed at this point. // destroying the data again seems crazy.
Then why is it here? The "as late as possible" bit from the old code no longer applies (it was there because of the descriptor lookup in shouldQueue; we want to wait for the lease holder to apply the command).
pkg/storage/replica_application_state_machine.go, line 525 at r1 (raw file):
) bool { _, existsInDesc := desc.GetReplicaDescriptor(storeID) // NB: if we're catching up from a preemptive snapshot then we won't
It'll be nice to eventually remove these last vestiges of preemptive snapshots.
pkg/storage/replica_application_state_machine.go, line 539 at r1 (raw file):
res := cmd.replicatedResult() // changeReplicasPreApply detects if this command will remove us from the range.
There's nothing named changeReplicasPreApply.
pkg/storage/replica_application_state_machine.go, line 540 at r1 (raw file):
// changeReplicasPreApply detects if this command will remove us from the range. // If so we delete stage the removal of all of our range data into this batch.
s/delete //
pkg/storage/replica_application_state_machine.go, line 551 at r1 (raw file):
// accepted. The replica will be destroyed in handleChangeReplicas. b.r.mu.Lock() b.r.mu.destroyStatus.Set(
It's not obvious that raftMu is held here, as required by the new comment.
pkg/storage/replica_destroy.go, line 128 at r1 (raw file):
// removeUninitializedReplica is called when we know that an uninitialized // replica has been removed and re-added as a different replica. We're safe
Add more details here: after the first "replica", add "(which knows its replica ID but not its key range)", and after the second "replica" add "(with a new replica ID)").
pkg/storage/replica_destroy.go, line 130 at r1 (raw file):
// replica has been removed and re-added as a different replica. We're safe // to GC its hard state because nobody cares about our votes anymore. The // sad thing is we aren't safe to GC the range's data because we don't know
So will this sometimes leak data? Or is it guaranteed that in the case of the split the LHS will be able to take care of it? It's bad to introduce anything that could permanently leak data (we effectively had this before, but it was at least discoverable by scanning the range-id-local keyspace. If we delete the range ID local keys but leak regular data, it would be harder to discover and track).
pkg/storage/replica_init.go, line 220 at r1 (raw file):
r.mu.minReplicaID = replicaID + 1 // Reset the raft group to force its recreation on next usage.
Update or remove this comment.
pkg/storage/replica_init.go, line 232 at r1 (raw file):
r.mu.internalRaftGroup = nil // If there was a previous replica, repropose its pending commands under
This was here for a reason; I think we need to be careful when dropping and recreating a Replica that we handle its pending commands properly. I don't think we have to repropose them, but they should at least get an error instead of being dropped on the floor.
pkg/storage/replica_learner_test.go, line 439 at r1 (raw file):
tc.RemoveReplicasOrFatal(t, scratchStartKey, tc.Target(1)) // remove learner // We need to wait until the range has been GC'd on server 1 before // we know that it can be safely added again.
What does "safely" mean here? Just that adding it again will fail without breaking anything? If so, s/safely/successfully/.
pkg/storage/replica_raft.go, line 1556 at r1 (raw file):
ctx context.Context, raftCmd storagepb.RaftCommand, ) (func(), error) { // There's a super nasty case where we're shutting down and we need to wait
What happens in this case? Should this be a TODO(ajwerner)
?
pkg/storage/store.go, line 2357 at r1 (raw file):
log.Fatalf(ctx, "refusing to clear replicated data for an initialized range %v in SplitRange", rightRepl) } // We want to delete this range's data but not its hard state as it
How could an uninitialized range have data covered by clearReplicatedOnly?
pkg/storage/store.go, line 2708 at r1 (raw file):
defer s.mu.Unlock() // We need to launch an async task to remove this uninitialized replica.
Is this a stale comment? I'm not seeing the async task. It's also strange to see these sanity checks after the call to destroyUninitializedReplicaRafMuLocked
.
pkg/storage/store.go, line 3581 at r1 (raw file):
return roachpb.NewError(err) } // We need to make sure we actually stepped that message.
Shouldn't this be conveyed by stepRaftGroup instead of a separate method called afterwards?
pkg/storage/store.go, line 3643 at r1 (raw file):
storeID := repl.store.StoreID() // TODO(ajwerner): use a better error than RangeNotFound.
This will require adding more error cases in various places, which will probably have backwards-compatibility implications.
pkg/storage/store.go, line 3801 at r1 (raw file):
elapsed.Seconds(), stats.entriesProcessed, stats.batchesProcessed, stats.stateAssertions) } // TODO(ajwerner): When does this case happen?
Uninitialized replicas will process raft Readies in order to vote (perhaps among other things, but mainly for voting)
pkg/storage/store_snapshot.go, line 647 at r1 (raw file):
// withReplicaForRequest ensures that requests with a non-zero replica // id are passed to a replica with a matching id. Given this is not a // preemptive snapshot we know that its id must be non-zero.
How do we know it's not preemptive? I thought this code path was reached for any kind of snapshot.
pkg/storage/store_snapshot.go, line 740 at r1 (raw file):
return crdberrors.AssertionFailedf(`expected a raft or learner snapshot`) } r := retry.StartWithCtx(ctx, retry.Options{
This retry loop worries me because it suggests a new range of possible failure modes here. Have we marked exactly the correct set of error cases with errRetry? What are the consequences of getting this wrong in either direction?
pkg/storage/store_snapshot.go, line 762 at r1 (raw file):
} // shouldAcceptSnapshotData is an optimization to check whether we should even
Move this comment to the method it describes.
pkg/storage/store_snapshot.go, line 801 at r1 (raw file):
existingRepl.mu.RUnlock() // shouldDestroyExisting is set to true if after the switch the existingRepl
There's nothing called shouldDestroyExisting.
pkg/storage/store_snapshot.go, line 852 at r1 (raw file):
// TODO(ajwerner): consider making this an assertion or using a better error. // We only ever snapshot from leaseholders. Maybe it's possible for an // old leaseholder to send an old snapshot. It'd be nice if we had a
Yes, I think old leaseholders are the reason for this case.
pkg/storage/store_snapshot.go, line 871 at r1 (raw file):
// move backwards. existingDesc = existingRepl.mu.state.Desc if existingRepl.mu.destroyStatus.RemovalPending() {
Should a pending removal be an errRetry here? Comment why we're doing a second check of destroyStatus.
pkg/storage/store_test.go, line 573 at r1 (raw file):
DestroyData: true, }); err != nil { t.Fatalf("didn't expected error re-removing same range: %v", err)
s/ed//
30ef8d5
to
3eada93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I took a first pass at your comments but need to keep working on testing. There are a couple scenarios that need new tests and a couple of disabled tests which definitely need to be updated.
I completely share your concern about mixed version clusters. I'm hopeful that it's generally going to be okay because in pre-learners we take care to not add replicas until we've successfully sent a pre-emptive snapshot which tends to mitigate many of these add and remove scenarios.
I'm all ears if you have more testing ideas.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/client_merge_test.go, line 1666 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This looks wrong and at least needs a comment. If we're now expecting an error here we should use a testutil assertion to make sure we're getting the specific error we want.
Definitely it's wrong. I started mucking around with this test and then decided to skip it. Will update.
pkg/storage/client_merge_test.go, line 2039 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Remember to delete any tests that are skipped because they no longer make sense before merging the PR (i like the skip-with-comment pattern for ease of early code reviewing)
Will do, not quite there yet. I want to revisit each of these tests individually.
pkg/storage/client_merge_test.go, line 2883 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This is only used in client_raft_test.go. Maybe move this and unreliableRaftHandler to a client_raft_helpers_test.go?
Good idea. I'm also going to type up a simple abstraction to make these things easier to set up and use.
pkg/storage/client_raft_test.go, line 1196 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/disable/disabled/
Done.
pkg/storage/client_replica_gc_test.go, line 151 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This test should probably be revived to prove that replicas that aren't GC'd synchronously will eventually be found asynchronously.
Agreed, will get to it next.
pkg/storage/client_split_test.go, line 3318 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Update the fatal message too. Are both error messages now possible or is it reliably "was not found" now?
Updated the message and added a comment. They both now happen.
pkg/storage/client_test.go, line 1250 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Assert on the specific expected error here.
Done.
pkg/storage/client_test.go, line 1270 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Reducing the scope of the error here seems like a good practice, but I'm curious what prompted this specific case.
Reverted. During an intermediate commit I had created a readBytesFromEngine
method and this code was moved to part of a helper that returned []roachbpb.Value
.
pkg/storage/queue.go, line 1179 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/not found/not found or was recreated with a new replica ID/
Done.
pkg/storage/replica.go, line 494 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Add comment that the ID may change from zero to non-zero, but once it has a non-zero value it will not change again.
Done.
pkg/storage/replica_application_result.go, line 310 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Then why is it here? The "as late as possible" bit from the old code no longer applies (it was there because of the descriptor lookup in shouldQueue; we want to wait for the lease holder to apply the command).
It's here as a hack. The applied state key gets written after the clearing of the data. Maybe other keys do too. This TODO definitely needs to get addressed before merging this change.
pkg/storage/replica_application_state_machine.go, line 525 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
It'll be nice to eventually remove these last vestiges of preemptive snapshots.
Agreed. That will need to wait until 20.1.
pkg/storage/replica_application_state_machine.go, line 539 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
There's nothing named changeReplicasPreApply.
right I inlined that method here, updated
pkg/storage/replica_application_state_machine.go, line 540 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/delete //
Done.
pkg/storage/replica_application_state_machine.go, line 551 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
It's not obvious that raftMu is held here, as required by the new comment.
everything underneath application holds the raftMu. Added more commentary.
pkg/storage/replica_application_state_machine.go, line 610 at r1 (raw file):
// merge transaction commits. // TODO(ajwerner): we know we're going to get something here for the right
re: the interactions of all these changes with preemptive snapshots will almost certainly be under-tested.
This case should only arise in the presence of a preemptive snapshot. I don't think it's handled correctly here. I suspect that the right course of action if we are trying to apply a merge while not part of the range is to completely destroy the preemptive snapshot.
Agreed. That will need to wait until 20.1.
I think I definitely need to write a test to exercise this specific case before this merges. The logic here remains exposed to the hazard.
pkg/storage/replica_destroy.go, line 128 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Add more details here: after the first "replica", add "(which knows its replica ID but not its key range)", and after the second "replica" add "(with a new replica ID)").
Done.
pkg/storage/replica_destroy.go, line 130 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
So will this sometimes leak data? Or is it guaranteed that in the case of the split the LHS will be able to take care of it? It's bad to introduce anything that could permanently leak data (we effectively had this before, but it was at least discoverable by scanning the range-id-local keyspace. If we delete the range ID local keys but leak regular data, it would be harder to discover and track).
Updated the comment to make it less scary. Indeed we'll always end up removing the data. I was less generally confident of that when this was first typed.
pkg/storage/replica_init.go, line 220 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Update or remove this comment.
Done.
pkg/storage/replica_init.go, line 232 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This was here for a reason; I think we need to be careful when dropping and recreating a Replica that we handle its pending commands properly. I don't think we have to repropose them, but they should at least get an error instead of being dropped on the floor.
I think they will get an error. Both replica removal paths call cancelPendingCommandsLocked()
pkg/storage/replica_learner_test.go, line 439 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
What does "safely" mean here? Just that adding it again will fail without breaking anything? If so, s/safely/successfully/.
Done.
pkg/storage/replica_raft.go, line 1556 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
What happens in this case? Should this be a
TODO(ajwerner)
?
This is stale. I typed this comment before I made removal synchronous in getOrCreateReplica()
. In fact it was this case that motivated the synchronous removal as opposed to using the queue. If we were shutting down and called getOrCreateReplica()
underneath raft we'd crash if we returned an error but nothing would ever run the queue.
Fixed.
pkg/storage/store.go, line 2357 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
How could an uninitialized range have data covered by clearReplicatedOnly?
Notice that the descriptor is from the split, not the uninitialized replica. The LHS had the data and now the RHS doesn't know about it. If we didn't clear the data then it could be leaked. I've updated the comment.
This needs a targeted test before this change merges.
pkg/storage/store.go, line 2708 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Is this a stale comment? I'm not seeing the async task. It's also strange to see these sanity checks after the call to
destroyUninitializedReplicaRafMuLocked
.
All around, yes I agree. Sanity checks moved up
pkg/storage/store.go, line 3581 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Shouldn't this be conveyed by stepRaftGroup instead of a separate method called afterwards?
Yes it should. I've added a return value to stepRaftGroup and also to withRaftGroup. It potentially adds some cruft to the code but it at least forces callers to think about destroyStatus.
pkg/storage/store.go, line 3643 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This will require adding more error cases in various places, which will probably have backwards-compatibility implications.
Removed the TODO, added a NB
pkg/storage/store.go, line 3801 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Uninitialized replicas will process raft Readies in order to vote (perhaps among other things, but mainly for voting)
Thanks. I left this for myself to come back to a while ago.
pkg/storage/store_snapshot.go, line 647 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
How do we know it's not preemptive? I thought this code path was reached for any kind of snapshot.
See the assertion at the top of the method. canApplyPreemptiveSnapshot
is now a different method.
pkg/storage/store_snapshot.go, line 740 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This retry loop worries me because it suggests a new range of possible failure modes here. Have we marked exactly the correct set of error cases with errRetry? What are the consequences of getting this wrong in either direction?
errRetry should only be returned if the current replica to which this is being addressed has a true value forIsRemovalPending()
.
I'm increasingly convinced that this method could use withReplicaForRequest
. The logic is very similar and the larger store mutex footprint here doesn't necessarily seem justified. It seems okay to grab the mutex before calling checkSnapshotOverlapLocked
pkg/storage/store_snapshot.go, line 762 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Move this comment to the method it describes.
Done.
pkg/storage/store_snapshot.go, line 801 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
There's nothing called shouldDestroyExisting.
That was stale, updated.
pkg/storage/store_snapshot.go, line 852 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Yes, I think old leaseholders are the reason for this case.
Done.
pkg/storage/store_snapshot.go, line 871 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Should a pending removal be an errRetry here? Comment why we're doing a second check of destroyStatus.
Yes, pending means another goroutine is synchronously removing this replica. I augmented the comment above to clarify why we're checking the destroy status again.
pkg/storage/store_test.go, line 573 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/ed//
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working my way through, but about to get pulled away from the review for a bit, so publishing what I've looked at so far.
Reviewed 6 of 24 files at r1, 8 of 18 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/queue.go, line 62 at r2 (raw file):
// processing state. type replicaItem struct { value roachpb.RangeID
nit: s/value/rangeID/ in its own commit.
pkg/storage/queue.go, line 1174 at r2 (raw file):
repl, _ := bq.getReplica(item.value) if repl != nil { if item.replicaID == 0 || item.replicaID == repl.ReplicaID() {
nit: pull this into a sameReplica
helper with a little description about the invariant it's meant to uphold.
pkg/storage/queue.go, line 1174 at r2 (raw file):
repl, _ := bq.getReplica(item.value) if repl != nil { if item.replicaID == 0 || item.replicaID == repl.ReplicaID() {
This could use a test in queue_test.go
.
pkg/storage/replica.go, line 494 at r2 (raw file):
// ReplicaID returns the ID for the Replica. It may be zero if the replica does // not know its ID. One a Replica has a non-zero ReplicaID it will never change.
s/One/Once/
Also, minor nit but this definition seems a little high in this file. It's above the interface assertions and the constructor. Perhaps move above Replica.String
?
pkg/storage/replica_gc_queue.go, line 117 at r2 (raw file):
func (rgcq *replicaGCQueue) shouldQueue( ctx context.Context, now hlc.Timestamp, repl *Replica, _ *config.SystemConfig, ) (should bool, prio float64) {
nit: ) (shouldQ bool, priority float64) {
seems to be the standard naming.
pkg/storage/replica_gc_queue.go, line 226 at r2 (raw file):
// Maybe it was deleted "under us" by being moved. currentDesc, currentMember := replyDesc.GetReplicaDescriptor(repl.store.StoreID()) if desc.RangeID == replyDesc.RangeID && currentMember && currentDesc.ReplicaID == replicaID {
Does every single queue need to perform this check? I thought this was why we put it in baseQueue
. Or is this special because it's checking the current replica descriptor from the meta range?
pkg/storage/replica_gc_queue.go, line 289 at r2 (raw file):
// If we're a member of the current range descriptor then we need to put // down a tombstone that will allow us to get a snapshot. Otherwise put
Why will a tombstone allow us to get a snapshot? Is this corresponding to the case where if desc.RangeID == replyDesc.RangeID && currentMember && currentDesc.ReplicaID > replicaID {
and you want to allow the replica in the descriptor to be added to the store once the current Replica
object is GC'ed? Could you elaborate on this a little bit?
pkg/storage/replica_init.go, line 162 at r2 (raw file):
func (r *Replica) setReplicaIDRaftMuLockedMuLocked(replicaID roachpb.ReplicaID) error { if r.mu.replicaID == replicaID {
Now that we're calling this in fewer places, do we even need such a general interface? Can we make this panic in all cases that r.mu.replica != 0
?
pkg/storage/replica_init.go, line 179 at r2 (raw file):
} ctx := r.AnnotateCtx(context.TODO()) if r.mu.replicaID != 0 && replicaID != 0 {
We already checked that replicaID != 0
above.
pkg/storage/replica_init.go, line 187 at r2 (raw file):
} // if r.mu.replicaID != 0 {
We can remove this now.
pkg/storage/replica_init.go, line 268 at r2 (raw file):
defer r.mu.Unlock() // If we raced on the checking the destroyStatus above that's fine as
"on the checking the"
pkg/storage/replica_proposal_buf.go, line 622 at r2 (raw file):
// Pass true for mayCampaignOnWake because we're about to propose a command. // Ignore removal errors: the proposals which might be resident in this buffer // must also be in the Replica's proposals map and will be notified.
Why is this true? If the proposals in the buffer haven't been flushed then they won't be in the Replica's proposal map. I think FlushLockedWithoutProposing
and its uses might be relevant to answering this question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 18 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @bdarnell)
pkg/storage/replica_application_result.go, line 310 at r1 (raw file):
Previously, ajwerner wrote…
It's here as a hack. The applied state key gets written after the clearing of the data. Maybe other keys do too. This TODO definitely needs to get addressed before merging this change.
You addressed this by not writing the applied state key, right?
pkg/storage/replica_application_state_machine.go, line 530 at r2 (raw file):
return false } _, existsInChange := change.Desc.GetReplicaDescriptor(storeID)
Do we need backwards compat with ChangeReplicas
that only included a DeprecatedUpdatedReplicas
?
pkg/storage/replica_application_state_machine.go, line 601 at r2 (raw file):
ctx, b.batch, b.batch, merge.RightDesc.NextReplicaID, clearRangeIDLocalOnly, mustClearRange, ); err != nil { return wrapWithNonDeterministicFailure(err, "unable to destroy range before merge")
Do you mind s/range/replica/
?
pkg/storage/replica_application_state_machine.go, line 642 at r2 (raw file):
} // Detect if this command will remove us from the range.
nit: move right below the res.Merge
case.
pkg/storage/replica_application_state_machine.go, line 664 at r2 (raw file):
if err := b.r.preDestroyRaftMuLocked( ctx, b.r.Engine(),
s/b.r.Engine()/b.batch/
pkg/storage/replica_application_state_machine.go, line 668 at r2 (raw file):
change.Desc.NextReplicaID, clearAll, true, /* mustClearRange */
Why is this true? Maybe we should rename this to mustUseRangeDelTombstone
or something to avoid confusion.
pkg/storage/replica_application_state_machine.go, line 670 at r2 (raw file):
true, /* mustClearRange */ ); err != nil { return err
return wrapWithNonDeterministicFailure(err, "unable to destroy replica before removal")
pkg/storage/replica_application_state_machine.go, line 930 at r2 (raw file):
// and the first range has a lease that there will not be a later hard-state. if shouldAssert { sm.r.mu.Lock()
I wouldn't tie this to shouldAssert
. It would be better to return a new flag from handleNonTrivialReplicatedEvalResult
.
pkg/storage/replica_application_state_machine.go, line 1068 at r2 (raw file):
sm.batch.changeRemovesReplica
The batch is only a member of replicaStateMachine
to avoid an allocation, so this isn't really kosher. The right way to do this is to either rediscover this fact (like we were doing before, right?) or to annotate the replicatedCmd
.
pkg/storage/replica_application_state_machine.go, line 1096 at r2 (raw file):
return nil } // If we're removed then we know that must have happened due to this
But we wouldn't have even made it here before returning a ErrRemoved
error, right? So can we assert that the replica is never removed?
pkg/storage/replica_destroy.go, line 137 at r2 (raw file):
// created in anticipation of a split in which case we'll clear its data when // the split trigger is applied. func (r *Replica) destroyUninitializedReplicaRaftMuLocked(
Should this return an error that the caller can decide what to do with?
pkg/storage/replica_destroy.go, line 148 at r2 (raw file):
batch, nextReplicaID, clearRangeIDLocalOnly,
Give this a small comment.
pkg/storage/replica_raft.go, line 824 at r2 (raw file):
// above with the apply.ErrRemoved. if isRemoved { return stats, expl, fmt.Errorf("replica removed while processing raft ready")
nit: errors.Errorf
Also, "replica unexpectedly removed ..."
pkg/storage/store.go, line 2166 at r2 (raw file):
// only after we snapshot but for the sake of mixed version clusters and // other potential behavior changes we pass the desc for the known to be // old RHS and have SplitRange clear its data.
Add one more sentence about short-circuiting the rest of this function.
pkg/storage/store.go, line 2180 at r2 (raw file):
// snapshot we couldn't possibly have received. if tombstone.NextReplicaID > rightDesc.ReplicaID { rightRng.mu.minReplicaID = 0
I don't follow from the comment above this why we only set minReplicaID to 0 when the tombstone is past the rightDesc.ReplicaID
.
pkg/storage/store.go, line 2285 at r2 (raw file):
// replica has been removed and re-added before applying this split and we need // to leave it uninitialized. It's not possible for it to have become // initialized
Missing a period.
Also, mention that this method clears the data on the RHS in this case.
pkg/storage/store.go, line 2313 at r2 (raw file):
// NB: We only remove from uninitReplicas and the replicaQueues maps here // so that we don't leave open a window where a replica is temporarily not // present in Store.mu.replicas.
Add to this comment about the case where oldRightDesc != nil
.
pkg/storage/store.go, line 2691 at r2 (raw file):
// TODO(peter): Could release s.mu.Lock() here. s.maybeGossipOnCapacityChange(ctx, rangeRemoveEvent) s.replicaGCQueue.MaybeRemove(rep.RangeID)
Isn't this called by s.scanner.RemoveReplica
?
pkg/storage/store.go, line 3670 at r2 (raw file):
repl.mu.Unlock() if !repl.IsInitialized() { return s.removeUninitializedReplicaRaftMuLocked(ctx, repl, tErr.ReplicaID+1)
nit: pull out nextReplicaID := tErr.ReplicaID+1
above.
pkg/storage/store.go, line 4054 at r2 (raw file):
creatingReplica *roachpb.ReplicaDescriptor, ) (_ *Replica, created bool, _ error) { r := retry.Start(retry.Options{
Leave a comment about why we need a retry loop so we might eventually feel comfortable removing it.
pkg/storage/store.go, line 4113 at r2 (raw file):
repl.mu.Unlock() defer repl.mu.Lock() if !repl.isInitializedRLocked() {
Didn't we just release the lock?
pkg/storage/store.go, line 4148 at r2 (raw file):
var err error if repl.mu.replicaID == 0 { err = repl.setReplicaIDRaftMuLockedMuLocked(replicaID)
What's the point of this? Don't we know this to be a no-op?
pkg/storage/store_snapshot.go, line 740 at r1 (raw file):
Previously, ajwerner wrote…
errRetry should only be returned if the current replica to which this is being addressed has a true value for
IsRemovalPending()
.I'm increasingly convinced that this method could use
withReplicaForRequest
. The logic is very similar and the larger store mutex footprint here doesn't necessarily seem justified. It seems okay to grab the mutex before callingcheckSnapshotOverlapLocked
Where did you land on this?
pkg/storage/store_snapshot.go, line 871 at r1 (raw file):
Previously, ajwerner wrote…
Yes, pending means another goroutine is synchronously removing this replica. I augmented the comment above to clarify why we're checking the destroy status again.
Be a bit more explicit here. We're looking at this again because we released the lock above and are re-acquiring it again.
pkg/storage/store_snapshot.go, line 807 at r2 (raw file):
// intended for existingRepl and if not, potentially destroy it. switch ds.reason { case destroyReasonRemovalPending, destroyReasonRemoved:
Is it worth checking whether the snapshot is intended for the replica ID that is being removed?
pkg/storage/store_snapshot.go, line 816 at r2 (raw file):
return nil case destroyReasonAlive:
nit: stray line
pkg/storage/store_snapshot.go, line 843 at r2 (raw file):
// replicaID) before the snapshot arrives. if okReplicaID { if !existingDesc.IsInitialized() && okReplicaID {
No need to check okReplicaID
again.
pkg/storage/store_snapshot.go, line 860 at r2 (raw file):
} // We think the snapshot is addressed to a higher replica ID which
We think?
pkg/storage/store_snapshot.go, line 880 at r2 (raw file):
} if existingRepl.mu.replicaID == snapReplicaDesc.ReplicaID { if !existingDesc.IsInitialized() {
We have the same logic above and I'm trying to make sense of it. Only if the descriptor isn't initialized do we need to check for overlaps? Why is that? Are we searching for placeholders?
3eada93
to
48bd0b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed some more comments but I still have more I want to do so hold off on taking another pass.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/queue.go, line 62 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: s/value/rangeID/ in its own commit.
I'll move all of the queue changes over to their own commit.
pkg/storage/queue.go, line 1174 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: pull this into a
sameReplica
helper with a little description about the invariant it's meant to uphold.
Done.
pkg/storage/queue.go, line 1174 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This could use a test in
queue_test.go
.
Done.
pkg/storage/replica.go, line 494 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/One/Once/
Also, minor nit but this definition seems a little high in this file. It's above the interface assertions and the constructor. Perhaps move above
Replica.String
?
Done.
pkg/storage/replica_application_result.go, line 310 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
You addressed this by not writing the applied state key, right?
Correct.
pkg/storage/replica_application_state_machine.go, line 539 at r1 (raw file):
Previously, ajwerner wrote…
right I inlined that method here, updated
Done. I moved this whole block down too.
pkg/storage/replica_application_state_machine.go, line 551 at r1 (raw file):
Previously, ajwerner wrote…
everything underneath application holds the raftMu. Added more commentary.
Done.
pkg/storage/replica_application_state_machine.go, line 642 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move right below the
res.Merge
case.
Doesn't this need to come after the truncated state case? Otherwise we'll write the truncated state to the batch after.
pkg/storage/replica_application_state_machine.go, line 664 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/b.r.Engine()/b.batch/
Done.
pkg/storage/replica_application_state_machine.go, line 668 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this true? Maybe we should rename this to
mustUseRangeDelTombstone
or something to avoid confusion.
I don't know. It felt right given the name. Would you be happy with false?
pkg/storage/replica_application_state_machine.go, line 670 at r2 (raw file):
err, "unable to destroy replica before removal")
Done.
pkg/storage/replica_application_state_machine.go, line 930 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I wouldn't tie this to
shouldAssert
. It would be better to return a new flag fromhandleNonTrivialReplicatedEvalResult
.
Done.
pkg/storage/replica_application_state_machine.go, line 1068 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
sm.batch.changeRemovesReplica
The batch is only a member of
replicaStateMachine
to avoid an allocation, so this isn't really kosher. The right way to do this is to either rediscover this fact (like we were doing before, right?) or to annotate thereplicatedCmd
.
Done.
pkg/storage/replica_application_state_machine.go, line 1096 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
But we wouldn't have even made it here before returning a
ErrRemoved
error, right? So can we assert that the replica is never removed?
Sure, assertion added.
pkg/storage/replica_destroy.go, line 137 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should this return an error that the caller can decide what to do with?
Done.
pkg/storage/replica_gc_queue.go, line 117 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
) (shouldQ bool, priority float64) {
seems to be the standard naming.
Done.
pkg/storage/replica_gc_queue.go, line 226 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Does every single queue need to perform this check? I thought this was why we put it in
baseQueue
. Or is this special because it's checking the current replica descriptor from the meta range?
I had to think about this a bunch but at the end I concluded that we don't need this case. The case that concerned me is where the replicaID
is 0. I did however go back and add logic to improve the purgatory code because it wasn't right. I'm going
pkg/storage/replica_gc_queue.go, line 289 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why will a tombstone allow us to get a snapshot? Is this corresponding to the case where
if desc.RangeID == replyDesc.RangeID && currentMember && currentDesc.ReplicaID > replicaID {
and you want to allow the replica in the descriptor to be added to the store once the currentReplica
object is GC'ed? Could you elaborate on this a little bit?
I've reflected on this some and I think we're safe because of the sanity checks in RemoveReplica. Reverted.
pkg/storage/replica_init.go, line 162 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Now that we're calling this in fewer places, do we even need such a general interface? Can we make this panic in all cases that
r.mu.replica != 0
?
Done.
pkg/storage/replica_init.go, line 187 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We can remove this now.
Done.
pkg/storage/replica_init.go, line 268 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"on the checking the"
Done.
pkg/storage/replica_proposal_buf.go, line 622 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this true? If the proposals in the buffer haven't been flushed then they won't be in the Replica's proposal map. I think
FlushLockedWithoutProposing
and its uses might be relevant to answering this question.
Wishful thinking.
pkg/storage/replica_raft.go, line 824 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
errors.Errorf
Also, "replica unexpectedly removed ..."
Done.
pkg/storage/store.go, line 2166 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add one more sentence about short-circuiting the rest of this function.
Done.
pkg/storage/store.go, line 2180 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't follow from the comment above this why we only set minReplicaID to 0 when the tombstone is past the
rightDesc.ReplicaID
.
If the tombstone is at or below rightDesc.ReplicaID then we don't need to touch minReplicaID.
I added some more commentary
pkg/storage/store.go, line 2285 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Missing a period.
Also, mention that this method clears the data on the RHS in this case.
Done.
pkg/storage/store.go, line 2313 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add to this comment about the case where
oldRightDesc != nil
.
Done.
pkg/storage/store.go, line 2691 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Isn't this called by
s.scanner.RemoveReplica
?
Indeed. Removed.
pkg/storage/store.go, line 3670 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: pull out
nextReplicaID := tErr.ReplicaID+1
above.
Done.
pkg/storage/store.go, line 4054 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Leave a comment about why we need a retry loop so we might eventually feel comfortable removing it.
The retry loop isn't new but sure, added.
pkg/storage/store.go, line 4113 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Didn't we just release the lock?
Fixed.
pkg/storage/store.go, line 4148 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What's the point of this? Don't we know this to be a no-op?
No, this is the case where a replica learns about its replica ID. This is common in the case of a pre-emptive snapshot.
pkg/storage/store_snapshot.go, line 740 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Where did you land on this?
I want to do it.
pkg/storage/store_snapshot.go, line 807 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it worth checking whether the snapshot is intended for the replica ID that is being removed?
Sure, we could return the error here. THis case is
pkg/storage/store_snapshot.go, line 860 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We think?
It's possible that it's not because we're not currently holding the mutex.
pkg/storage/store_snapshot.go, line 880 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We have the same logic above and I'm trying to make sense of it. Only if the descriptor isn't initialized do we need to check for overlaps? Why is that? Are we searching for placeholders?
If the descriptor is initialized and covers more ranges then we'll subsume them. We need to accept a wider snapshot assuming we're in the range at this point.
48bd0b5
to
6b42756
Compare
Before this PR we would write such a tombstone when detecting that a range had been merged via a snapshot or via the replica gc queue but curiously not when merging the range by applying a merge. Release Justification: Came across this oddity while working on updating tests for cockroachdb#40751. Maybe is not justified. Release note: None
6b42756
to
fbb46bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress continues but it's still not quite there yet. On the bright side I am able to produce a number of failures by performing a version upgrade during TestRepartitioning:
Over a long running period this should give us some added confidence. We should also stress the version upgrade tests.
Would it be helpful to split this change into ~3 commits.
- Make queues replica ID aware
- Ensure that replica IDs don't change
- Ensure that splits and merges are only processed by replicas in the range
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/replica_proposal_buf.go, line 622 at r2 (raw file):
Previously, ajwerner wrote…
Wishful thinking.
Returning the error now. This should be the right thing.
pkg/storage/store_snapshot.go, line 740 at r1 (raw file):
Previously, ajwerner wrote…
I want to do it.
This turns out to be a major simplification and I think just works.
fbb46bc
to
2358606
Compare
2358606
to
9d8507e
Compare
Temporarily messed the diff up. |
WARNING: this change needs more testing to target the changes it makes and at least some of the disabled tests should be reworked. This is a big and scary change at this point in the cycle so I'm getting it out before I'm really happy with it. There are some known TODOs. On the plus side it does not seem to reproduce any crashes in hours with the `partitionccl.TestRepartitioning` which readily produces crashes on master when run under roachprod stress within ~20 minutes. We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. * The queues are now replica ID aware. If a replica was added to the queue and the replica found when trying to pop are not the same and we knew the replica ID of replica when we added it then we should not process it. This hopefully gives us some new invariants: * There is only ever at most 1 *Replica which IsAlive() for a range on a store at a time. * Once a *Replica has a non-zero ReplicaID is never changes. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change.
9d8507e
to
6f7c3f2
Compare
This changes the partitioning tests to use 9 replicas rather than 3 and specifies lease_preferences in addition to constraints to allow the same types of scans to work. This test was intended to help catch errors cockroachdb#40751 but somehow seems to reproduce cockroachdb#40213 even on master. ``` I190918 05:26:52.501218 53874 storage/replica_gc_queue.go:286 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] destroying local data I190918 05:26:52.501282 53874 storage/store.go:2613 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removing replica r14/1 I190918 05:26:52.502818 53874 storage/replica_destroy.go:152 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removed 12 (0+12) keys in 1ms [clear=0ms commit=0ms] I190918 05:26:56.941153 80013 storage/replica_command.go:1580 [n4,replicate,s4,r14/10:/Table/1{8-9}] change replicas (add [(n1,s1):12LEARNER] remove []): existing descriptor r14:/Table/1{8-9} [(n6,s6):11, (n9,s9):2, (n8,s8):9, (n2,s2):7, (n4,s4):10, next=12, gen=26] F190918 05:26:57.019170 154 storage/store.go:4053 [n1,s1] found non-zero HardState.Commit on uninitialized replica [n1,s1,r14/?:{-}]. HS={Term:7 Vote:10 Commit:75 XXX_unrecognized:[]} ``` This reproduction occurs readily on this branch running roachprod stress: ``` make roachprod-stress CLUSTER=$USER-stress-2 PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning ``` Release note: None
This changes the partitioning tests to use 9 replicas rather than 3 and specifies lease_preferences in addition to constraints to allow the same types of scans to work. This test was intended to help catch errors cockroachdb#40751 but somehow seems to reproduce cockroachdb#40213 even on master. ``` I190918 05:26:52.501218 53874 storage/replica_gc_queue.go:286 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] destroying local data I190918 05:26:52.501282 53874 storage/store.go:2613 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removing replica r14/1 I190918 05:26:52.502818 53874 storage/replica_destroy.go:152 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removed 12 (0+12) keys in 1ms [clear=0ms commit=0ms] I190918 05:26:56.941153 80013 storage/replica_command.go:1580 [n4,replicate,s4,r14/10:/Table/1{8-9}] change replicas (add [(n1,s1):12LEARNER] remove []): existing descriptor r14:/Table/1{8-9} [(n6,s6):11, (n9,s9):2, (n8,s8):9, (n2,s2):7, (n4,s4):10, next=12, gen=26] F190918 05:26:57.019170 154 storage/store.go:4053 [n1,s1] found non-zero HardState.Commit on uninitialized replica [n1,s1,r14/?:{-}]. HS={Term:7 Vote:10 Commit:75 XXX_unrecognized:[]} ... ``` This reproduction occurs readily on this branch running roachprod stress: ``` make roachprod-stress CLUSTER=$USER-stress-2 PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning ``` Release note: None
…arner This commit adds an annotation to raft request messages to indicate that the sender believes the current replica is a learner. If the current replica on the recipient was created as a preemptive snapshot (it's initialized but not in the range) then we should remove that replica immediately. This commit also contains some testing of scenarios where we upgrade from a preemptive snapshot. Release note: None
6f7c3f2
to
966bce6
Compare
This changes the partitioning tests to use 9 replicas rather than 3 and specifies lease_preferences in addition to constraints to allow the same types of scans to work. This test was intended to help catch errors cockroachdb#40751 but somehow seems to reproduce cockroachdb#40213 even on master. ``` I190918 05:26:52.501218 53874 storage/replica_gc_queue.go:286 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] destroying local data I190918 05:26:52.501282 53874 storage/store.go:2613 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removing replica r14/1 I190918 05:26:52.502818 53874 storage/replica_destroy.go:152 [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removed 12 (0+12) keys in 1ms [clear=0ms commit=0ms] I190918 05:26:56.941153 80013 storage/replica_command.go:1580 [n4,replicate,s4,r14/10:/Table/1{8-9}] change replicas (add [(n1,s1):12LEARNER] remove []): existing descriptor r14:/Table/1{8-9} [(n6,s6):11, (n9,s9):2, (n8,s8):9, (n2,s2):7, (n4,s4):10, next=12, gen=26] F190918 05:26:57.019170 154 storage/store.go:4053 [n1,s1] found non-zero HardState.Commit on uninitialized replica [n1,s1,r14/?:{-}]. HS={Term:7 Vote:10 Commit:75 XXX_unrecognized:[]} ... ``` This reproduction occurs readily on this branch running roachprod stress: ``` make roachprod-stress CLUSTER=$USER-stress-2 PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning ``` Release note: None
Closed in favor of #40892. |
Before this PR we would write such a tombstone when detecting that a range had been merged via a snapshot or via the replica gc queue but curiously not when merging the range by applying a merge. Release Justification: Came across this oddity while working on updating tests for cockroachdb#40751. Release note: None
40814: storage: write a tombstone with math.MaxInt32 when processing a merge r=ajwerner a=ajwerner Before this PR we would write such a tombstone when detecting that a range had been merged via a snapshot or via the replica gc queue but curiously not when merging the range by applying a merge. Release Justification: Came across this oddity while working on updating tests for #40751. Maybe is not justified. Release note: None Co-authored-by: Andrew Werner <[email protected]>
WARNING: this change needs more testing to target the changes it makes
and at least some of the disabled tests should be reworked. This is a big
and scary change at this point in the cycle so I'm getting it out before
I'm really happy with it. There are some known TODOs.
On the plus side it does not seem to reproduce any crashes in hours with the
partitionccl.TestRepartitioning
which readily produces crashes on masterwhen run under roachprod stress within ~20 minutes.
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.
Here's a high level overview of the change:
change.
the current replica will be removed completely.
completely removes itself while applying that command.
of a Replica. Replicas about to be synchronously removed are in
destroyReasonRemovalPending.
and the replica found when trying to pop are not the same and we knew the
replica ID of replica when we added it then we should not process it.
This hopefully gives us some new invariants:
at a time.
The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.
Fixes #40367.
Release justification: This commit is safe for 19.2 because it fixes release
blockers.
Release note (bug fix): Fix crashes by preventing replica ID change.